{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "synthetic_features_and_outliers.ipynb",
      "version": "0.3.2",
      "views": {},
      "default_view": {},
      "provenance": [],
      "collapsed_sections": [
        "i5Ul3zf5QYvW",
        "jByCP8hDRZmM",
        "WvgxW0bUSC-c",
        "copyright-notice"
      ]
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "metadata": {
        "id": "copyright-notice",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "#### Copyright 2017 Google LLC."
      ]
    },
    {
      "metadata": {
        "colab": {
          "autoexec": {
            "wait_interval": 0,
            "startup": false
          }
        },
        "id": "copyright-notice2",
        "colab_type": "code",
        "cellView": "both"
      },
      "cell_type": "code",
      "outputs": [],
      "execution_count": 0,
      "source": [
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "metadata": {
        "id": "4f3CKqFUqL2-",
        "colab_type": "text",
        "slideshow": {
          "slide_type": "slide"
        }
      },
      "cell_type": "markdown",
      "source": [
        " # Caract\u00e9ristiques synth\u00e9tiques et anomalies"
      ]
    },
    {
      "metadata": {
        "id": "jnKgkN5fHbGy",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " **Objectifs d'apprentissage\u00a0:**\n",
        "  * Cr\u00e9er une caract\u00e9ristique synth\u00e9tique qui est le ratio de deux autres caract\u00e9ristiques\n",
        "  * Utiliser cette nouvelle caract\u00e9ristique comme entr\u00e9e dans un mod\u00e8le de r\u00e9gression lin\u00e9aire\n",
        "  * Am\u00e9liorer l'efficacit\u00e9 du mod\u00e8le en identifiant et en \u00e9liminant les anomalies des donn\u00e9es d'entr\u00e9e"
      ]
    },
    {
      "metadata": {
        "id": "VOpLo5dcHbG0",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " Reprenons le mod\u00e8le utilis\u00e9 dans l'exercice \"Premiers pas avec TensorFlow\". \n",
        "\n",
        "Vous allez, tout d'abord, importer les donn\u00e9es sur l'immobilier en Californie dans un `DataFrame` *Pandas*\u00a0:"
      ]
    },
    {
      "metadata": {
        "id": "S8gm6BpqRRuh",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " ## Configuration"
      ]
    },
    {
      "metadata": {
        "id": "9D8GgUovHbG0",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "from __future__ import print_function\n",
        "\n",
        "import math\n",
        "\n",
        "from IPython import display\n",
        "from matplotlib import cm\n",
        "from matplotlib import gridspec\n",
        "import matplotlib.pyplot as plt\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "import sklearn.metrics as metrics\n",
        "import tensorflow as tf\n",
        "from tensorflow.python.data import Dataset\n",
        "\n",
        "tf.logging.set_verbosity(tf.logging.ERROR)\n",
        "pd.options.display.max_rows = 10\n",
        "pd.options.display.float_format = '{:.1f}'.format\n",
        "\n",
        "california_housing_dataframe = pd.read_csv(\"https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv\", sep=\",\")\n",
        "\n",
        "california_housing_dataframe = california_housing_dataframe.reindex(\n",
        "    np.random.permutation(california_housing_dataframe.index))\n",
        "california_housing_dataframe[\"median_house_value\"] /= 1000.0\n",
        "california_housing_dataframe"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "I6kNgrwCO_ms",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " Vous allez ensuite configurer la fonction d'entr\u00e9e et la d\u00e9finir pour l'entra\u00eenement de mod\u00e8le\u00a0:"
      ]
    },
    {
      "metadata": {
        "id": "5RpTJER9XDub",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "def my_input_fn(features, targets, batch_size=1, shuffle=True, num_epochs=None):\n",
        "    \"\"\"Trains a linear regression model of one feature.\n",
        "  \n",
        "    Args:\n",
        "      features: pandas DataFrame of features\n",
        "      targets: pandas DataFrame of targets\n",
        "      batch_size: Size of batches to be passed to the model\n",
        "      shuffle: True or False. Whether to shuffle the data.\n",
        "      num_epochs: Number of epochs for which data should be repeated. None = repeat indefinitely\n",
        "    Returns:\n",
        "      Tuple of (features, labels) for next data batch\n",
        "    \"\"\"\n",
        "    \n",
        "    # Convert pandas data into a dict of np arrays.\n",
        "    features = {key:np.array(value) for key,value in dict(features).items()}                                           \n",
        " \n",
        "    # Construct a dataset, and configure batching/repeating.\n",
        "    ds = Dataset.from_tensor_slices((features,targets)) # warning: 2GB limit\n",
        "    ds = ds.batch(batch_size).repeat(num_epochs)\n",
        "    \n",
        "    # Shuffle the data, if specified.\n",
        "    if shuffle:\n",
        "      ds = ds.shuffle(buffer_size=10000)\n",
        "    \n",
        "    # Return the next batch of data.\n",
        "    features, labels = ds.make_one_shot_iterator().get_next()\n",
        "    return features, labels"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "VgQPftrpHbG3",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "def train_model(learning_rate, steps, batch_size, input_feature):\n",
        "  \"\"\"Trains a linear regression model.\n",
        "  \n",
        "  Args:\n",
        "    learning_rate: A `float`, the learning rate.\n",
        "    steps: A non-zero `int`, the total number of training steps. A training step\n",
        "      consists of a forward and backward pass using a single batch.\n",
        "    batch_size: A non-zero `int`, the batch size.\n",
        "    input_feature: A `string` specifying a column from `california_housing_dataframe`\n",
        "      to use as input feature.\n",
        "      \n",
        "  Returns:\n",
        "    A Pandas `DataFrame` containing targets and the corresponding predictions done\n",
        "    after training the model.\n",
        "  \"\"\"\n",
        "  \n",
        "  periods = 10\n",
        "  steps_per_period = steps / periods\n",
        "\n",
        "  my_feature = input_feature\n",
        "  my_feature_data = california_housing_dataframe[[my_feature]].astype('float32')\n",
        "  my_label = \"median_house_value\"\n",
        "  targets = california_housing_dataframe[my_label].astype('float32')\n",
        "\n",
        "  # Create input functions.\n",
        "  training_input_fn = lambda: my_input_fn(my_feature_data, targets, batch_size=batch_size)\n",
        "  predict_training_input_fn = lambda: my_input_fn(my_feature_data, targets, num_epochs=1, shuffle=False)\n",
        "  \n",
        "  # Create feature columns.\n",
        "  feature_columns = [tf.feature_column.numeric_column(my_feature)]\n",
        "    \n",
        "  # Create a linear regressor object.\n",
        "  my_optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)\n",
        "  my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)\n",
        "  linear_regressor = tf.estimator.LinearRegressor(\n",
        "      feature_columns=feature_columns,\n",
        "      optimizer=my_optimizer\n",
        "  )\n",
        "\n",
        "  # Set up to plot the state of our model's line each period.\n",
        "  plt.figure(figsize=(15, 6))\n",
        "  plt.subplot(1, 2, 1)\n",
        "  plt.title(\"Learned Line by Period\")\n",
        "  plt.ylabel(my_label)\n",
        "  plt.xlabel(my_feature)\n",
        "  sample = california_housing_dataframe.sample(n=300)\n",
        "  plt.scatter(sample[my_feature], sample[my_label])\n",
        "  colors = [cm.coolwarm(x) for x in np.linspace(-1, 1, periods)]\n",
        "\n",
        "  # Train the model, but do so inside a loop so that we can periodically assess\n",
        "  # loss metrics.\n",
        "  print(\"Training model...\")\n",
        "  print(\"RMSE (on training data):\")\n",
        "  root_mean_squared_errors = []\n",
        "  for period in range (0, periods):\n",
        "    # Train the model, starting from the prior state.\n",
        "    linear_regressor.train(\n",
        "        input_fn=training_input_fn,\n",
        "        steps=steps_per_period,\n",
        "    )\n",
        "    # Take a break and compute predictions.\n",
        "    predictions = linear_regressor.predict(input_fn=predict_training_input_fn)\n",
        "    predictions = np.array([item['predictions'][0] for item in predictions])\n",
        "    \n",
        "    # Compute loss.\n",
        "    root_mean_squared_error = math.sqrt(\n",
        "      metrics.mean_squared_error(predictions, targets))\n",
        "    # Occasionally print the current loss.\n",
        "    print(\"  period %02d : %0.2f\" % (period, root_mean_squared_error))\n",
        "    # Add the loss metrics from this period to our list.\n",
        "    root_mean_squared_errors.append(root_mean_squared_error)\n",
        "    # Finally, track the weights and biases over time.\n",
        "    # Apply some math to ensure that the data and line are plotted neatly.\n",
        "    y_extents = np.array([0, sample[my_label].max()])\n",
        "    \n",
        "    weight = linear_regressor.get_variable_value('linear/linear_model/%s/weights' % input_feature)[0]\n",
        "    bias = linear_regressor.get_variable_value('linear/linear_model/bias_weights')\n",
        "    \n",
        "    x_extents = (y_extents - bias) / weight\n",
        "    x_extents = np.maximum(np.minimum(x_extents,\n",
        "                                      sample[my_feature].max()),\n",
        "                           sample[my_feature].min())\n",
        "    y_extents = weight * x_extents + bias\n",
        "    plt.plot(x_extents, y_extents, color=colors[period]) \n",
        "  print(\"Model training finished.\")\n",
        "\n",
        "  # Output a graph of loss metrics over periods.\n",
        "  plt.subplot(1, 2, 2)\n",
        "  plt.ylabel('RMSE')\n",
        "  plt.xlabel('Periods')\n",
        "  plt.title(\"Root Mean Squared Error vs. Periods\")\n",
        "  plt.tight_layout()\n",
        "  plt.plot(root_mean_squared_errors)\n",
        "\n",
        "  # Create a table with calibration data.\n",
        "  calibration_data = pd.DataFrame()\n",
        "  calibration_data[\"predictions\"] = pd.Series(predictions)\n",
        "  calibration_data[\"targets\"] = pd.Series(targets)\n",
        "  display.display(calibration_data.describe())\n",
        "\n",
        "  print(\"Final RMSE (on training data): %0.2f\" % root_mean_squared_error)\n",
        "  \n",
        "  return calibration_data"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "FJ6xUNVRm-do",
        "colab_type": "text",
        "slideshow": {
          "slide_type": "slide"
        }
      },
      "cell_type": "markdown",
      "source": [
        " ## T\u00e2che\u00a01\u00a0: Essayer une caract\u00e9ristique synth\u00e9tique\n",
        "\n",
        "Les caract\u00e9ristiques `total_rooms` et `population` comptabilisent les totaux d'un \u00eelot urbain donn\u00e9.\n",
        "\n",
        "Mais que se passe-t-il si un \u00eelot urbain est plus dens\u00e9ment peupl\u00e9 qu'un autre\u00a0? Vous pouvez \u00e9tudier la relation entre la densit\u00e9 de population et le prix m\u00e9dian des logements en cr\u00e9ant une caract\u00e9ristique synth\u00e9tique qui est un ratio de `total_rooms` et `population`.\n",
        "\n",
        "Dans la cellule ci-dessous, cr\u00e9ez une caract\u00e9ristique nomm\u00e9e `rooms_per_person` et utilisez-la comme `input_feature` pour `train_model()`.\n",
        "\n",
        "Quelles sont les meilleures performances qu'il est possible d'obtenir avec cette seule caract\u00e9ristique en r\u00e9glant le taux d'apprentissage\u00a0? (Plus les performances sont \u00e9lev\u00e9es, mieux la droite de r\u00e9gression devrait ajuster les donn\u00e9es et plus la valeur RMSE finale devrait \u00eatre basse.)"
      ]
    },
    {
      "metadata": {
        "id": "isONN2XK32Wo",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " **REMARQUE\u00a0**: Vous jugerez peut-\u00eatre utile d'ajouter quelques cellules de code ci-dessous afin d'essayer diff\u00e9rents taux d'apprentissage et de comparer les r\u00e9sultats. Pour ajouter une nouvelle cellule de code, passez la souris sous le centre de cette cellule, puis cliquez sur **CODE**."
      ]
    },
    {
      "metadata": {
        "id": "5ihcVutnnu1D",
        "colab_type": "code",
        "slideshow": {
          "slide_type": "slide"
        },
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          },
          "test": {
            "output": "ignore",
            "timeout": 600
          }
        },
        "cellView": "both"
      },
      "source": [
        "#\n",
        "# YOUR CODE HERE\n",
        "#\n",
        "california_housing_dataframe[\"rooms_per_person\"] =\n",
        "\n",
        "calibration_data = train_model(\n",
        "    learning_rate=0.00005,\n",
        "    steps=500,\n",
        "    batch_size=5,\n",
        "    input_feature=\"rooms_per_person\"\n",
        ")"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "i5Ul3zf5QYvW",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " ### Solution\n",
        "\n",
        "Cliquez ci-dessous pour afficher une solution."
      ]
    },
    {
      "metadata": {
        "id": "Leaz2oYMQcBf",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "california_housing_dataframe[\"rooms_per_person\"] = (\n",
        "    california_housing_dataframe[\"total_rooms\"] / california_housing_dataframe[\"population\"])\n",
        "\n",
        "calibration_data = train_model(\n",
        "    learning_rate=0.05,\n",
        "    steps=500,\n",
        "    batch_size=5,\n",
        "    input_feature=\"rooms_per_person\")"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "ZjQrZ8mcHFiU",
        "colab_type": "text",
        "slideshow": {
          "slide_type": "slide"
        }
      },
      "cell_type": "markdown",
      "source": [
        " ## T\u00e2che\u00a02\u00a0: Identifier les anomalies\n",
        "\n",
        "Vous pouvez visualiser les performances du mod\u00e8le en cr\u00e9ant un diagramme de dispersion des pr\u00e9dictions par rapport aux valeurs cibles. Id\u00e9alement, elles doivent se trouver sur une ligne diagonale parfaitement corr\u00e9l\u00e9e.\n",
        "\n",
        "Utilisez la fonction `scatter()` de Pyplot pour cr\u00e9er un diagramme de ce type, en utilisant le mod\u00e8le \"pi\u00e8ces par personne\" que vous avez entra\u00een\u00e9 dans le cadre de la T\u00e2che\u00a01.\n",
        "\n",
        "Constatez-vous des anomalies\u00a0? Remontez jusqu'aux donn\u00e9es source en observant la r\u00e9partition des valeurs dans le mod\u00e8le `rooms_per_person`."
      ]
    },
    {
      "metadata": {
        "id": "P0BDOec4HbG_",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "# YOUR CODE HERE"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "jByCP8hDRZmM",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " ### Solution\n",
        "\n",
        "Cliquez ci-dessous pour afficher la solution."
      ]
    },
    {
      "metadata": {
        "id": "s0tiX2gdRe-S",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "plt.figure(figsize=(15, 6))\n",
        "plt.subplot(1, 2, 1)\n",
        "plt.scatter(calibration_data[\"predictions\"], calibration_data[\"targets\"])"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "kMQD0Uq3RqTX",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " Les donn\u00e9es de calibrage montrent un alignement de la plupart des points de dispersion. La ligne est pratiquement verticale (mais nous y reviendrons). Concentrez-vous, pour le moment, sur les points qui s'\u00e9cartent de la ligne. Vous remarquerez qu'ils sont relativement peu nombreux.\n",
        "\n",
        "Si vous repr\u00e9sentez graphiquement un histogramme du mod\u00e8le `rooms_per_person`, vous constatez que les donn\u00e9es d'entr\u00e9e pr\u00e9sentent quelques anomalies\u00a0:"
      ]
    },
    {
      "metadata": {
        "id": "POTM8C_ER1Oc",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "plt.subplot(1, 2, 2)\n",
        "_ = california_housing_dataframe[\"rooms_per_person\"].hist()"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "9l0KYpBQu8ed",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " ## T\u00e2che\u00a03\u00a0: \u00c9liminer les anomalies\n",
        "\n",
        "Voyons s'il est possible d'am\u00e9liorer encore l'ajustement du mod\u00e8le en rempla\u00e7ant les valeurs aberrantes de `rooms_per_person` par des valeurs minimales ou maximales raisonnables.\n",
        "\n",
        "Voici un exemple qui illustre l'application d'une fonction \u00e0 une `Series` Pandas\u00a0:\n",
        "\n",
        "    clipped_feature = my_dataframe[\"my_feature_name\"].apply(lambda x: max(x, 0))\n",
        "\n",
        "L'\u00e9l\u00e9ment `clipped_feature` ci-dessous ne comportera aucune valeur inf\u00e9rieure \u00e0 `0`."
      ]
    },
    {
      "metadata": {
        "id": "rGxjRoYlHbHC",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "# YOUR CODE HERE"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "WvgxW0bUSC-c",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " ### Solution\n",
        "\n",
        "Cliquez ci-dessous pour afficher la solution."
      ]
    },
    {
      "metadata": {
        "id": "8YGNjXPaSMPV",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " L'histogramme cr\u00e9\u00e9 au cours de la T\u00e2che\u00a02 fait appara\u00eetre que la majorit\u00e9 des valeurs sont inf\u00e9rieures \u00e0 `5`. Vous allez r\u00e9duire `rooms_per_person` \u00e0 5 et tracer un histogramme pour effectuer une nouvelle v\u00e9rification des r\u00e9sultats."
      ]
    },
    {
      "metadata": {
        "id": "9YyARz6gSR7Q",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "california_housing_dataframe[\"rooms_per_person\"] = (\n",
        "    california_housing_dataframe[\"rooms_per_person\"]).apply(lambda x: min(x, 5))\n",
        "\n",
        "_ = california_housing_dataframe[\"rooms_per_person\"].hist()"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "vO0e1p_aSgKA",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " Pour v\u00e9rifier la r\u00e9ussite de cette op\u00e9ration, vous allez proc\u00e9der \u00e0 un nouvel apprentissage et imprimer \u00e0 nouveau les donn\u00e9es de calibrage\u00a0:"
      ]
    },
    {
      "metadata": {
        "id": "ZgSP2HKfSoOH",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "calibration_data = train_model(\n",
        "    learning_rate=0.05,\n",
        "    steps=500,\n",
        "    batch_size=5,\n",
        "    input_feature=\"rooms_per_person\")"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "gySE-UgfSony",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "_ = plt.scatter(calibration_data[\"predictions\"], calibration_data[\"targets\"])"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    }
  ]
}
